Assignment 3 _ Movehub City Rankings

Author

Bablu Prasad (23104917)

Introduction:

MoveHub know that everyone has different reasons for moving. That’s why they have ranked some of the worlds top cities in relation to key factors, like safety, property affordability and climate. Simply selecting the factors most important to you and they will show you the best cities for you to consider!

For more details visit here Click here

We have total 3 files downloaded from Kaggle :

  • movehubqualityoflife.csv

  • movehubcostofliving.csv

  • cities.csv (merged LAT & LONG from data got from -> Downloadable Geographic Databases | Simplemaps.com)

Flow Diagram of Dataset:

I have merged all the 3 datset to one dataset (combined_data) in later operation. Also the 4th dataset which was having latitudinal and longitudinal data merged initially with city.csv dataset in excel.

DiagrammeR::mermaid('
%%{init: {"flowchart": {"htmlLabels": false}} }%%
  graph BT
     B[<b>movehubcostofliving.csv</b>] ---|City| A(<b>Combined_data</b> - contain details of all 3 datasets)
    C[<b>movehubqualityoflife.csv</b>] -->|City| A
    D[<b>read.csv</b>] -->|City| A
    E[<b>worldcities.xlsx</b>] -->|City| D
   
   style A fill:#f9f,stroke:#333,stroke-width:2px,font-weight:bold,font-style:italic;
')

Flow of the process performed:

Warning: package 'DiagrammeR' was built under R version 4.3.3

Importing important Libraries:

# Suppress warnings for conflicts when loading libraries 
options(warn.conflicts = TRUE) 
# Load libraries with conflicts suppressed 
suppressWarnings({ 
  library(tidyverse, exclude = c("annotate", "combine"))  # For data manipulation and visualization
  library(RColorBrewer)  # For color palettes
  library(tm)  # For text mining and preprocessing
  library(wordcloud)  # For creating word clouds
  library(gridExtra)  # For arranging plots
  library(leaflet)  # For interactive maps
  library(corrplot)  # For visualizing correlation matrices
  library(caTools)  # For data splitting
  library(DiagrammeR)
  library(MASS, exclude = c("dplyr"))  # For statistical functions
}) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: NLP


Attaching package: 'NLP'


The following object is masked from 'package:ggplot2':

    annotate



Attaching package: 'gridExtra'


The following object is masked from 'package:dplyr':

    combine


corrplot 0.92 loaded


Attaching package: 'MASS'


The following object is masked from 'package:dplyr':

    select
# Reset options to default 
options(warn.conflicts = FALSE)

Loading the all 3 datasets:

cost_of_living <- read.csv("E:\\Assigment3\\movehubcostofliving.csv") 
quality_of_life <- read.csv("E:\\Assigment3\\movehubqualityoflife.csv") 
cities <- read.csv("E:\\Assigment3\\cities.csv")

Dataset Understanding:

# Display all dataframes 
head(cost_of_living)
       City Cappuccino Cinema  Wine Gasoline Avg.Rent Avg.Disposable.Income
1  Lausanne       3.15  12.59  8.40     1.32  1714.00               4266.11
2    Zurich       3.28  12.59  8.40     1.31  2378.61               4197.55
3    Geneva       2.80  12.94 10.49     1.28  2607.95               3917.72
4     Basel       3.50  11.89  7.35     1.25  1649.29               3847.76
5     Perth       2.87  11.43 10.08     0.97  2083.14               3358.55
6 Nashville       3.84  12.00 13.50     0.65  2257.14               3089.75
head(quality_of_life)
          City Movehub.Rating Purchase.Power Health.Care Pollution
1      Caracas          65.18          11.25       44.44     83.45
2 Johannesburg          84.08          53.99       59.98     47.39
3    Fortaleza          80.17          52.28       45.46     66.32
4  Saint Louis          85.25          80.40       77.29     31.33
5  Mexico City          75.07          24.28       61.76     18.95
6      Detroit          70.63          73.81       63.05     83.45
  Quality.of.Life Crime.Rating
1            8.61        85.70
2           51.26        83.93
3           36.68        78.65
4           87.51        78.13
5           27.91        77.86
6           50.99        76.69
head(cities)
              City       Country     LAT      LONG
1          Oakland United States 37.7904 -122.2166
2         Oakville        Canada   43.45  -79.6833
3 Oaxaca de Juárez        Mexico 17.0606  -96.7253
4       Oberhausen       Germany 51.4967    6.8706
5          Obihiro         Japan 42.9167     143.2
6          Obninsk        Russia 55.0931   36.6106
# Checking datatype 
print("Datatype of 'cost_of_living' dataset:")
[1] "Datatype of 'cost_of_living' dataset:"
print(str(cost_of_living))
'data.frame':   216 obs. of  7 variables:
 $ City                 : chr  "Lausanne" "Zurich" "Geneva" "Basel" ...
 $ Cappuccino           : num  3.15 3.28 2.8 3.5 2.87 3.84 2.35 3.92 2.13 4.48 ...
 $ Cinema               : num  12.6 12.6 12.9 11.9 11.4 ...
 $ Wine                 : num  8.4 8.4 10.49 7.35 10.08 ...
 $ Gasoline             : num  1.32 1.31 1.28 1.25 0.97 0.65 0.99 1.57 1.15 1.68 ...
 $ Avg.Rent             : num  1714 2379 2608 1649 2083 ...
 $ Avg.Disposable.Income: num  4266 4198 3918 3848 3359 ...
NULL
print("Datatype of 'quality_of_life' dataset:")
[1] "Datatype of 'quality_of_life' dataset:"
print(str(quality_of_life))
'data.frame':   216 obs. of  7 variables:
 $ City           : chr  "Caracas" "Johannesburg" "Fortaleza" "Saint Louis" ...
 $ Movehub.Rating : num  65.2 84.1 80.2 85.2 75.1 ...
 $ Purchase.Power : num  11.2 54 52.3 80.4 24.3 ...
 $ Health.Care    : num  44.4 60 45.5 77.3 61.8 ...
 $ Pollution      : num  83.5 47.4 66.3 31.3 18.9 ...
 $ Quality.of.Life: num  8.61 51.26 36.68 87.51 27.91 ...
 $ Crime.Rating   : num  85.7 83.9 78.7 78.1 77.9 ...
NULL
print("Datatype of 'cities' dataset:")
[1] "Datatype of 'cities' dataset:"
print(str(cities))
'data.frame':   3543 obs. of  4 variables:
 $ City   : chr  "Oakland" "Oakville" "Oaxaca de Juárez" "Oberhausen" ...
 $ Country: chr  "United States" "Canada" "Mexico" "Germany" ...
 $ LAT    : chr  "37.7904" "43.45" "17.0606" "51.4967" ...
 $ LONG   : chr  "-122.2166" "-79.6833" "-96.7253" "6.8706" ...
NULL
# Ensuring city names are consistent across datasets 
cost_of_living$City <- tolower(cost_of_living$City)
quality_of_life$City <- tolower(quality_of_life$City)
cities$City <- tolower(cities$City)
# Display the dimensions of datasets 
print("Dimension of row-columns of 'cities' dataset:")
[1] "Dimension of row-columns of 'cities' dataset:"
print(dim(cities))
[1] 3543    4
print("Dimension of row-columns of 'cost of living' dataset:")
[1] "Dimension of row-columns of 'cost of living' dataset:"
print(dim(cost_of_living))
[1] 216   7
print("Dimension of row-columns of 'quality of life' dataset:")
[1] "Dimension of row-columns of 'quality of life' dataset:"
print(dim(quality_of_life))
[1] 216   7

Merging the data set based on City:

# Merge the datasets on the 'City' column 
combined_data <- cost_of_living %>%
  inner_join(quality_of_life, by = "City") %>%
  inner_join(cities, by = "City")
# displaying merged dataframe
head(combined_data)
       City Cappuccino Cinema  Wine Gasoline Avg.Rent Avg.Disposable.Income
1  lausanne       3.15  12.59  8.40     1.32  1714.00               4266.11
2    geneva       2.80  12.94 10.49     1.28  2607.95               3917.72
3     basel       3.50  11.89  7.35     1.25  1649.29               3847.76
4     perth       2.87  11.43 10.08     0.97  2083.14               3358.55
5 nashville       3.84  12.00 13.50     0.65  2257.14               3089.75
6  canberra       2.35  11.42 10.08     0.99  1984.74               3023.91
  Movehub.Rating Purchase.Power Health.Care Pollution Quality.of.Life
1          87.21          90.77       65.85     87.62           73.21
2          83.27          61.22       74.88     29.43           82.76
3          84.20          78.17       79.74     59.18           88.27
4          95.38          62.11       80.56     23.53           74.62
5          80.61          80.30       60.30      0.00           80.50
6          83.23          63.26       91.90     11.48           93.05
  Crime.Rating       Country      LAT     LONG
1        35.55   Switzerland  46.5198   6.6335
2        54.36   Switzerland  46.2017   6.1469
3        28.12   Switzerland  47.5547   7.5906
4        50.01     Australia -31.9559 115.8606
5        25.50 United States  36.1715 -86.7842
6        40.36     Australia -35.2931 149.1269
# Checking merged dataframe datatype 
print("Datatype of 'combined_data' dataset:")
[1] "Datatype of 'combined_data' dataset:"
print(str(combined_data))
'data.frame':   201 obs. of  16 variables:
 $ City                 : chr  "lausanne" "geneva" "basel" "perth" ...
 $ Cappuccino           : num  3.15 2.8 3.5 2.87 3.84 2.35 3.92 2.13 4.48 2.49 ...
 $ Cinema               : num  12.6 12.9 11.9 11.4 12 ...
 $ Wine                 : num  8.4 10.49 7.35 10.08 13.5 ...
 $ Gasoline             : num  1.32 1.28 1.25 0.97 0.65 0.99 1.57 1.15 1.68 0.95 ...
 $ Avg.Rent             : num  1714 2608 1649 2083 2257 ...
 $ Avg.Disposable.Income: num  4266 3918 3848 3359 3090 ...
 $ Movehub.Rating       : num  87.2 83.3 84.2 95.4 80.6 ...
 $ Purchase.Power       : num  90.8 61.2 78.2 62.1 80.3 ...
 $ Health.Care          : num  65.8 74.9 79.7 80.6 60.3 ...
 $ Pollution            : num  87.6 29.4 59.2 23.5 0 ...
 $ Quality.of.Life      : num  73.2 82.8 88.3 74.6 80.5 ...
 $ Crime.Rating         : num  35.5 54.4 28.1 50 25.5 ...
 $ Country              : chr  "Switzerland" "Switzerland" "Switzerland" "Australia" ...
 $ LAT                  : chr  "46.5198" "46.2017" "47.5547" "-31.9559" ...
 $ LONG                 : chr  "6.6335" "6.1469" "7.5906" "115.8606" ...
NULL

Data Preprocessing:

# Convert LAT and LONG columns to numeric if they aren't already 
combined_data$LAT <- as.numeric(as.character(combined_data$LAT))
combined_data$LONG <- as.numeric(as.character(combined_data$LONG))
#checking the null value 
combined_data[combined_data == "N/A"]<-NA 
colSums(is.na(combined_data)) 
                 City            Cappuccino                Cinema 
                    0                     0                     0 
                 Wine              Gasoline              Avg.Rent 
                    0                     0                     0 
Avg.Disposable.Income        Movehub.Rating        Purchase.Power 
                    0                     0                     0 
          Health.Care             Pollution       Quality.of.Life 
                    0                     0                     0 
         Crime.Rating               Country                   LAT 
                    0                     0                     0 
                 LONG 
                    0 
# Define numerical and categorical columns 
numerical_columns <- names(combined_data)[sapply(combined_data, is.numeric)]
categorical_columns <- names(combined_data)[sapply(combined_data, function(x) is.factor(x) | is.character(x))]
# Summary statistics 
summary(combined_data)
     City             Cappuccino        Cinema            Wine       
 Length:201         Min.   :0.460   Min.   : 1.810   Min.   : 2.130  
 Class :character   1st Qu.:1.310   1st Qu.: 4.270   1st Qu.: 4.260  
 Mode  :character   Median :2.040   Median : 6.820   Median : 6.530  
                    Mean   :1.966   Mean   : 6.829   Mean   : 7.103  
                    3rd Qu.:2.490   3rd Qu.: 7.970   3rd Qu.: 8.400  
                    Max.   :4.480   Max.   :79.490   Max.   :26.150  
    Gasoline        Avg.Rent      Avg.Disposable.Income Movehub.Rating  
 Min.   :0.070   Min.   : 120.7   Min.   : 120.7        Min.   : 59.88  
 1st Qu.:0.800   1st Qu.: 605.6   1st Qu.: 556.7        1st Qu.: 75.32  
 Median :1.000   Median : 980.6   Median :1460.5        Median : 80.85  
 Mean   :1.025   Mean   :1089.0   Mean   :1384.9        Mean   : 79.72  
 3rd Qu.:1.350   3rd Qu.:1382.3   3rd Qu.:2013.6        3rd Qu.: 83.55  
 Max.   :1.690   Max.   :5052.3   Max.   :4266.1        Max.   :100.00  
 Purchase.Power   Health.Care      Pollution     Quality.of.Life
 Min.   : 6.38   Min.   :20.83   Min.   : 0.00   Min.   : 5.29  
 1st Qu.:28.83   1st Qu.:58.90   1st Qu.:24.54   1st Qu.:42.65  
 Median :47.98   Median :67.88   Median :39.16   Median :62.82  
 Mean   :45.41   Mean   :66.31   Mean   :46.25   Mean   :58.76  
 3rd Qu.:58.77   3rd Qu.:77.31   3rd Qu.:68.21   3rd Qu.:77.07  
 Max.   :91.85   Max.   :95.96   Max.   :92.42   Max.   :93.05  
  Crime.Rating     Country               LAT              LONG        
 Min.   :10.86   Length:201         Min.   :-43.53   Min.   :-157.85  
 1st Qu.:28.91   Class :character   1st Qu.: 22.30   1st Qu.:  -9.15  
 Median :41.32   Mode  :character   Median : 40.40   Median :  10.00  
 Mean   :41.38                      Mean   : 30.38   Mean   :  10.50  
 3rd Qu.:50.38                      3rd Qu.: 48.78   3rd Qu.:  46.72  
 Max.   :85.70                      Max.   : 63.43   Max.   : 174.78  

Exploratory Data Analysis:

1. Count of Unique Values in Each Categorical Column:

# Count of Unique Values in Each Categorical Column
unique_counts <- sapply(combined_data[sapply(combined_data, is.character)], function(x) length(unique(x)))

# Plotting the counts as a pie chart
pie(unique_counts, main = "Count of Unique Values in Each Categorical Column", col = rainbow(length(unique_counts)),
    labels = paste(names(unique_counts), ": ", unique_counts), cex = 0.8)

2. Wordcloud of country & cities in data set:

# Create a corpus from the "Country" column
country_corpus <- Corpus(VectorSource(combined_data$Country))

# Preprocess the text for country names (remove punctuation, convert to lowercase, remove stopwords)
suppressWarnings({
  country_corpus <- tm_map(country_corpus, content_transformer(tolower))
  country_corpus <- tm_map(country_corpus, removePunctuation)
  country_corpus <- tm_map(country_corpus, removeNumbers)
  country_corpus <- tm_map(country_corpus, removeWords, stopwords("english"))
})

# Define a custom function to remove empty documents from the corpus
removeEmptyDocs <- function(corpus) {
  empty_docs <- which(sapply(corpus, function(x) length(unlist(strsplit(as.character(x), " ")))) == 0)
  if (length(empty_docs) > 0) {
    corpus <- corpus[-empty_docs]
  }
  return(corpus)
}

# Remove empty documents from the corpus
suppressWarnings({
  country_corpus <- removeEmptyDocs(country_corpus)
})

# Convert corpus to a plain text document
country_text <- tm_map(country_corpus, content_transformer(as.character))
Warning in tm_map.SimpleCorpus(country_corpus,
content_transformer(as.character)): transformation drops documents
# Generate word cloud for countries with adjustments for long country names
wordcloud(country_text, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "Dark2"), 
          max.words = 100, scale=c(4, 0.5))

# Create a corpus from the "City" column
city_corpus <- Corpus(VectorSource(combined_data$City))

# Preprocess the text for city names (remove punctuation, convert to lowercase, remove stopwords)
suppressWarnings({
  city_corpus <- tm_map(city_corpus, content_transformer(tolower))
  city_corpus <- tm_map(city_corpus, removePunctuation)
  city_corpus <- tm_map(city_corpus, removeNumbers)
  city_corpus <- tm_map(city_corpus, removeWords, stopwords("english"))
})

# Define a custom function to remove empty documents from the corpus
removeEmptyDocs <- function(corpus) {
  empty_docs <- which(sapply(corpus, function(x) length(unlist(strsplit(as.character(x), " ")))) == 0)
  if (length(empty_docs) > 0) {
    corpus <- corpus[-empty_docs]
  }
  return(corpus)
}

# Remove empty documents from the corpus
suppressWarnings({
  city_corpus <- removeEmptyDocs(city_corpus)
})

# Convert corpus to a plain text document
city_text <- tm_map(city_corpus, content_transformer(as.character))
Warning in tm_map.SimpleCorpus(city_corpus, content_transformer(as.character)):
transformation drops documents
# Generate word cloud for cities with adjustments for long city names
wordcloud(city_text, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "Dark2"), 
          max.words = 50, scale=c(2, 0.5))

3. Histogram of distribution of numerical columns:

# Loop through each numerical column to create histograms
for(column_name in numerical_columns) {
  # Print the name of the current column
  print(column_name)
  
  # Create a symbol from the string column name
  column_sym <- sym(column_name)
  
  # Plot histogram for the current numerical column using tidy evaluation
  p <- ggplot(combined_data, aes(x = !!column_sym)) + 
    geom_histogram(binwidth = 1, fill = "blue", color = "black") + 
    theme_minimal() + 
    labs(title = paste("Histogram of", column_name))
  
  # This will actually draw the plot
  print(p)
}
[1] "Cappuccino"

[1] "Cinema"

[1] "Wine"

[1] "Gasoline"

[1] "Avg.Rent"

[1] "Avg.Disposable.Income"

[1] "Movehub.Rating"

[1] "Purchase.Power"

[1] "Health.Care"

[1] "Pollution"

[1] "Quality.of.Life"

[1] "Crime.Rating"

[1] "LAT"

[1] "LONG"

4. Are there any significant differences in Cappuccino prices across different countries?

# Filter the data to include only the top 10 countries by Cappuccino prices
top_10_countries <- combined_data %>%
  group_by(Country) %>%
  summarize(Avg_Cappuccino_Price = mean(Cappuccino)) %>%
  top_n(10, Avg_Cappuccino_Price) %>%
  pull(Country)

filtered_data <- combined_data %>%
  filter(Country %in% top_10_countries)

# Create a boxplot to visualize the distribution of Cappuccino prices across the top 10 countries
ggplot(filtered_data, aes(x = Country, y = Cappuccino)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Cappuccino Prices Across Top 10 Countries",
       x = "Country",
       y = "Cappuccino Price (USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

# Perform ANOVA to test for significant differences in Cappuccino prices across the top 10 countries
anova_result <- aov(Cappuccino ~ Country, data = filtered_data)
summary(anova_result)
            Df Sum Sq Mean Sq F value Pr(>F)  
Country      9  3.526  0.3917   3.657 0.0278 *
Residuals   10  1.071  0.1071                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Yes, there are significant differences in Cappuccino prices across different countries according to the ANOVA test results.

5. Geographical distribution of top 10 cities with high Movehub Ratings:

# Order cities by Movehub Rating and select top 10
top_10_cities <- combined_data[order(combined_data$Movehub.Rating, decreasing = TRUE), ][1:10, ]

# Create Leaflet map
m <- leaflet(top_10_cities) %>% 
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addCircleMarkers(
    lng = ~LONG, lat = ~LAT, 
    radius = ~sqrt(Movehub.Rating) * 3,  # Adjust circle marker size based on Movehub Rating
    color = "#4CAF50", fillColor = "#4CAF50", fillOpacity = 0.7,
    popup = ~paste("<b>", City, "</b>", "<br>", "<b>Movehub Rating:</b> ", Movehub.Rating)
  ) %>%
  addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addScaleBar(position = "bottomleft") %>%
  addLegend(
    position = "bottomright",
    colors = "#4CAF50",
    labels = "Circle Radius (Movehub Rating)"
  ) %>%
  addEasyButton(easyButton(
    icon = "fa-info-circle", title = "Information",
    onClick = JS("function(btn, map) { alert('This map displays the top 10 cities based on Movehub Rating, with circle markers representing Movehub Rating.'); }")
  ))

# Display the map
m 

6. Geographical distribution of top 10 cities with Quality.of.Life:

# Order cities by Quality of Life and select top 10
top_10_cities <- combined_data[order(combined_data$Quality.of.Life, decreasing = TRUE), ][1:10, ]

# Create Leaflet map
m <- leaflet(top_10_cities) %>% 
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addCircleMarkers(
    lng = ~LONG, lat = ~LAT, 
    radius = ~sqrt(Quality.of.Life) * 3,  # Adjust circle marker size based on Quality of Life
    color = "#2196F3", fillColor = "#2196F3", fillOpacity = 0.7,
    popup = ~paste("<b>", City, "</b>", "<br>", "<b>Quality of Life:</b> ", Quality.of.Life)
  ) %>%
  addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addScaleBar(position = "bottomleft") %>%
  addLegend(
    position = "bottomright",
    colors = "#2196F3",
    labels = "Circle Radius (Quality of Life)"
  ) %>%
  addEasyButton(easyButton(
    icon = "fa-info-circle", title = "Information",
    onClick = JS("function(btn, map) { alert('This map displays the top 10 cities based on Quality of Life, with circle markers representing Quality of Life.'); }")
  ))

# Display the map
m 

7. Geographical distribution of top 10 cities with Crime Rating:

# Order cities by Crime Rating and select top 10
top_10_cities <- combined_data[order(combined_data$Crime.Rating), ][1:10, ]

# Create Leaflet map
m <- leaflet(top_10_cities) %>% 
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addCircleMarkers(
    lng = ~LONG, lat = ~LAT, 
    radius = ~sqrt(Crime.Rating) * 3,  # Adjust circle marker size based on Crime Rating
    color = "#FF5722", fillColor = "#FF5722", fillOpacity = 0.7,
    popup = ~paste("<b>", City, "</b>", "<br>", "<b>Crime Rating:</b> ", Crime.Rating)
  ) %>%
  addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addScaleBar(position = "bottomleft") %>%
  addLegend(
    position = "bottomright",
    colors = "#FF5722",
    labels = "Circle Radius (Crime Rating)"
  ) %>%
  addEasyButton(easyButton(
    icon = "fa-info-circle", title = "Information",
    onClick = JS("function(btn, map) { alert('This map displays the top 10 cities with the lowest Crime Rating, with circle markers representing Crime Rating.'); }")
  ))

# Display the map
m 
#Creating new dataframe for Modelling: 
res <- combined_data[complete.cases(combined_data$LAT, combined_data$LONG), ] 

Overall correlation:

# Overall correlation
num.cols <- sapply(res, is.numeric)
corrPLOT <- corrplot(cor(res[,num.cols]), method='ellipse', order="AOE")

  • Movehub Rating has negative correlation with Pollution, Crime as one can expect
  • Other features give a positive correlation with Movehub Rating

(better) Correlation:

# (better) Correlation
res <- data.frame(res %>% dplyr::select(-Quality.of.Life,-City,-Country))
num.cols <- sapply(res, is.numeric)
corrPLOT <- corrplot(cor(res[,num.cols]), method='number', order="AOE") 

  • The higher correlated features with Movehub.Rating are Purchase.Power and `Avg.Disposable.Income
  • Gasoline does not show a strong correlation (to my surprise)
  • Cappuccino (I guess it’s the average price) is almost as important as Health Care.

Overall, there are linear correlations between Movehub Rating and the other numerical features. So we may try as a first test a linear model.

APPLYING MODEL TO PREDICT–> Movehub Rating score :

# Train Test Split
set.seed(101)
split <- sample.split(res$Movehub.Rating, SplitRatio = 0.7)
train <- subset(res, split == TRUE)
test <- subset(res, split == FALSE)
# Function to plot the result of a given model and other information
plotModel <- function(mod) {
  # Create dataframe with prediction and real values
  mod.predictions <- predict(mod, test)
  mod.res <- cbind(mod.predictions, test$Movehub.Rating)
  colnames(mod.res) <- c('Predictions', 'True')
  mod.res <- as.data.frame(mod.res)
  
  # Make plots of residuals, etc.
  g1 <- ggplot(data = mod.res, aes(x = Predictions, y = True)) +
    geom_point() + xlim(50, 100) + ylim(50, 100) +
    geom_abline(intercept = 0, slope = 1, color = 'red')
  g2 <- ggplot(data = mod.res, aes(x = True - Predictions)) +
    geom_histogram(bins = 50)
  g3 <- ggplot(data = mod.res, aes(x = Predictions, y = True - Predictions)) +
    geom_point()
  
  # Arrange the plots
  grid.arrange(g1, g2, g3, nrow = 2, ncol = 2)
  
  # Calculate metrics
  mse <- mean((mod.res$True - mod.res$Predictions)^2)
  rmse <- mse^0.5
  SSE <- sum((mod.res$Predictions - mod.res$True)^2)
  SST <- sum((mean(test$Movehub.Rating) - mod.res$True)^2)
  R2 <- 1 - SSE / SST
  
  cat("MSE: %f RMSE : %f R^2 :%f", mse, rmse, R2)
}

-Linear model with all features:

# Applying linear Regression
linModel <- lm(Movehub.Rating ~ ., data = train)
summary(linModel)

Call:
lm(formula = Movehub.Rating ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7935 -1.4489 -0.2631  1.2780 14.4159 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           58.3390764  2.6577153  21.951  < 2e-16 ***
Cappuccino            -0.2624010  0.5396183  -0.486  0.62761    
Cinema                 0.0206004  0.0449341   0.458  0.64741    
Wine                   0.1021572  0.1294251   0.789  0.43140    
Gasoline               2.9019739  1.0225435   2.838  0.00529 ** 
Avg.Rent               0.0023670  0.0005704   4.149 6.06e-05 ***
Avg.Disposable.Income -0.0016842  0.0007950  -2.119  0.03607 *  
Purchase.Power         0.3244978  0.0296638  10.939  < 2e-16 ***
Health.Care            0.0313351  0.0203895   1.537  0.12682    
Pollution             -0.0040804  0.0118137  -0.345  0.73037    
Crime.Rating           0.0349714  0.0202210   1.729  0.08616 .  
LAT                   -0.0161790  0.0129305  -1.251  0.21315    
LONG                   0.0095956  0.0046706   2.054  0.04198 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.045 on 127 degrees of freedom
Multiple R-squared:  0.801, Adjusted R-squared:  0.7822 
F-statistic: 42.61 on 12 and 127 DF,  p-value: < 2.2e-16
# Plotting the graph 
plotModel(linModel)

MSE: %f RMSE : %f R^2 :%f 17.39767 4.171052 0.6145516

-Linear model with stepwise regression:

From the summary of the linear model, we see that not all the features are significant. So we can try to improve this linear model by performing a stepwise selection of features in both direction.

# Linear model with stepwise regression
step <- stepAIC(linModel, direction = "both")
Start:  AIC=324.18
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent + 
    Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution + 
    Crime.Rating + LAT + LONG

                        Df Sum of Sq    RSS    AIC
- Pollution              1      1.11 1179.0 322.31
- Cinema                 1      1.95 1179.9 322.41
- Cappuccino             1      2.19 1180.1 322.44
- Wine                   1      5.78 1183.7 322.87
- LAT                    1     14.52 1192.4 323.90
<none>                               1177.9 324.18
- Health.Care            1     21.91 1199.8 324.76
- Crime.Rating           1     27.74 1205.7 325.44
- LONG                   1     39.15 1217.1 326.76
- Avg.Disposable.Income  1     41.63 1219.5 327.04
- Gasoline               1     74.70 1252.6 330.79
- Avg.Rent               1    159.70 1337.6 339.98
- Purchase.Power         1   1109.89 2287.8 415.12

Step:  AIC=322.31
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent + 
    Avg.Disposable.Income + Purchase.Power + Health.Care + Crime.Rating + 
    LAT + LONG

                        Df Sum of Sq    RSS    AIC
- Cappuccino             1      1.76 1180.8 320.52
- Cinema                 1      2.68 1181.7 320.63
- Wine                   1      5.99 1185.0 321.02
- LAT                    1     16.41 1195.4 322.25
<none>                               1179.0 322.31
- Health.Care            1     21.26 1200.3 322.81
- Crime.Rating           1     27.26 1206.3 323.51
+ Pollution              1      1.11 1177.9 324.18
- LONG                   1     38.71 1217.7 324.83
- Avg.Disposable.Income  1     43.59 1222.6 325.39
- Gasoline               1     90.60 1269.6 330.68
- Avg.Rent               1    166.71 1345.7 338.83
- Purchase.Power         1   1132.63 2311.7 414.57

Step:  AIC=320.52
Movehub.Rating ~ Cinema + Wine + Gasoline + Avg.Rent + Avg.Disposable.Income + 
    Purchase.Power + Health.Care + Crime.Rating + LAT + LONG

                        Df Sum of Sq    RSS    AIC
- Cinema                 1      2.92 1183.7 318.87
- Wine                   1      4.42 1185.2 319.04
<none>                               1180.8 320.52
- LAT                    1     19.59 1200.4 320.82
- Health.Care            1     21.29 1202.1 321.02
- Crime.Rating           1     26.81 1207.6 321.66
+ Cappuccino             1      1.76 1179.0 322.31
+ Pollution              1      0.67 1180.1 322.44
- LONG                   1     38.55 1219.3 323.02
- Avg.Disposable.Income  1     57.38 1238.2 325.16
- Gasoline               1     89.15 1269.9 328.71
- Avg.Rent               1    164.99 1345.8 336.83
- Purchase.Power         1   1161.98 2342.8 414.44

Step:  AIC=318.87
Movehub.Rating ~ Wine + Gasoline + Avg.Rent + Avg.Disposable.Income + 
    Purchase.Power + Health.Care + Crime.Rating + LAT + LONG

                        Df Sum of Sq    RSS    AIC
- Wine                   1      7.99 1191.7 317.81
<none>                               1183.7 318.87
- LAT                    1     18.29 1202.0 319.01
- Health.Care            1     20.92 1204.6 319.32
- Crime.Rating           1     27.58 1211.3 320.09
+ Cinema                 1      2.92 1180.8 320.52
+ Cappuccino             1      2.00 1181.7 320.63
+ Pollution              1      1.27 1182.4 320.72
- LONG                   1     40.95 1224.7 321.63
- Avg.Disposable.Income  1     55.73 1239.4 323.31
- Gasoline               1     88.80 1272.5 326.99
- Avg.Rent               1    162.06 1345.8 334.83
- Purchase.Power         1   1183.33 2367.0 413.88

Step:  AIC=317.81
Movehub.Rating ~ Gasoline + Avg.Rent + Avg.Disposable.Income + 
    Purchase.Power + Health.Care + Crime.Rating + LAT + LONG

                        Df Sum of Sq    RSS    AIC
<none>                               1191.7 317.81
- LAT                    1     20.02 1211.7 318.14
- Health.Care            1     21.71 1213.4 318.34
- Crime.Rating           1     23.06 1214.8 318.49
+ Wine                   1      7.99 1183.7 318.87
+ Cinema                 1      6.49 1185.2 319.04
+ Pollution              1      2.58 1189.1 319.51
+ Cappuccino             1      0.06 1191.6 319.80
- LONG                   1     43.24 1234.9 320.80
- Avg.Disposable.Income  1     48.09 1239.8 321.35
- Gasoline               1     84.26 1276.0 325.37
- Avg.Rent               1    188.27 1380.0 336.34
- Purchase.Power         1   1214.75 2406.4 414.20
step$anova
Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent + 
    Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution + 
    Crime.Rating + LAT + LONG

Final Model:
Movehub.Rating ~ Gasoline + Avg.Rent + Avg.Disposable.Income + 
    Purchase.Power + Health.Care + Crime.Rating + LAT + LONG

          Step Df Deviance Resid. Df Resid. Dev      AIC
1                                127   1177.913 324.1800
2  - Pollution  1 1.106496       128   1179.020 322.3115
3 - Cappuccino  1 1.758833       129   1180.779 320.5202
4     - Cinema  1 2.924060       130   1183.703 318.8665
5       - Wine  1 7.988275       131   1191.691 317.8081
finalModel <- lm(Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent + Purchase.Power + Health.Care, data = train)
summary(finalModel)

Call:
lm(formula = Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent + 
    Purchase.Power + Health.Care, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2675 -1.8157 -0.3025  1.5552 14.7603 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    63.4341660  1.3299976  47.695  < 2e-16 ***
Cappuccino     -0.6709330  0.4581963  -1.464   0.1455    
Gasoline        1.4177586  0.7911495   1.792   0.0754 .  
Avg.Rent        0.0022600  0.0004962   4.554 1.17e-05 ***
Purchase.Power  0.2552946  0.0171398  14.895  < 2e-16 ***
Health.Care     0.0296582  0.0206849   1.434   0.1540    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.194 on 134 degrees of freedom
Multiple R-squared:  0.7691,    Adjusted R-squared:  0.7605 
F-statistic: 89.26 on 5 and 134 DF,  p-value: < 2.2e-16
# Plotting the graph of residuals for the final model
plotModel(finalModel)

MSE: %f RMSE : %f R^2 :%f 15.75842 3.969687 0.6508697

-GLM model:

# Fit a glm model using your data 
glm_model <- glm(Movehub.Rating ~ ., data = train, family = gaussian)
# Apply the plotModel function to the glm model
plotModel(glm_model)

MSE: %f RMSE : %f R^2 :%f 17.39767 4.171052 0.6145516
# Comparison between all features and selected ones
anova(linModel, finalModel,glm_model)
Analysis of Variance Table

Model 1: Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent + 
    Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution + 
    Crime.Rating + LAT + LONG
Model 2: Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent + Purchase.Power + 
    Health.Care
Model 3: Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent + 
    Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution + 
    Crime.Rating + LAT + LONG
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    127 1177.9                                
2    134 1367.1 -7   -189.22 2.9144 0.007341 **
3    127 1177.9  7    189.22 2.9144 0.007341 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: From all the above model i found that lm and glm with all featurer didnot achieve a better R^2 score over stepwise linear regression (0.6508697 vs. 0.6145516)

——————————————————————————————————-

THANKS YOU !!

You can find the code and other data sets here: https://github.com/babluprasad70/R_Assignment